1. Business Understanding (Project Name)----> (Domain Knowledge)-->Domain Expert
2. Data Requirement (What data is required to complete the project)
3. Data Collection (From Different sources and with Different Tools and Technologies)
4. Data Preparation (EDA/Data Preparation/Data Cleaning/Data Munging)
5. Machine Learning/Predictive Analysis /Data Modelling/Data Mining ( Clean Data + Algorithms= Model)
Generalised Model= Which work well in unseen Data
(Clean Data + Algorithms= Model )
6. Model Evaluation -Evaluation metrics (Test Model)
7. Model Tuning
8. Model Deployment
9. Monitoring
EDA/ Data Preparation/Data Cleaning Steps
1. Removing Duplicate data
2. Missing Value Treatment
6 Methods
3. Outlier Treatment
5 Methods
4. Categorical to Numerical Conversion
5. Numerical to Categorical Conversion(Binning)
6. Feature Scaling
7. Feature Transformation
8. Feature selection
1 0 to 1
2 1000 to 2000
3 100000 to 200000
Visualization: Story Telling
1. Univariate (1 column) Analysis: Histogram/Boxplot
2. Bivariate (2 Column) Analysis: Line plot/Scatter plot/Bar plot..........
3. Multivariate: (3 Columns) Analsis: Heatmap (Correlation Analysis / Pairplot (Mutiple scatter plot)
When to use what
1.
1. Import Libraries
2. Reading Dataset
3. Data Preprartion
4. Data Visulaization
Clean Data (Input and output )
5. Split the Data into Training and Testing
6. Apply algorithm Training Data
7. Predict on test Data
8. Model Evaluation
# Importing all necessary Libaries: Data Science Packages
import numpy as np # numpy used for mathematical operation on array
import pandas as pd # pandas used for data manipulation on dataframe
import seaborn as sns # seaborn used for data visualization
import matplotlib.pyplot as plt # matplotlib used for data visualization
# Read the data with pandas
df = pd.read_csv("Movie_classification.csv", header=0)
df
| Marketing expense | Production expense | Multiplex coverage | Budget | Movie_length | Lead_ Actor_Rating | Lead_Actress_rating | Director_rating | Producer_rating | Critic_rating | Trailer_views | 3D_available | Time_taken | Twitter_hastags | Genre | Avg_age_actors | Num_multiplex | Collection | Tech_Oscar | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20.1264 | 59.62 | 0.462 | 36524.125 | 138.7 | 7.825 | 8.095 | 7.910 | 7.995 | 7.94 | 527367 | YES | 109.60 | 223.840 | Thriller | 23 | 494 | 48000 | 1 |
| 1 | 20.5462 | 69.14 | 0.531 | 35668.655 | 152.4 | 7.505 | 7.650 | 7.440 | 7.470 | 7.44 | 494055 | NO | 146.64 | 243.456 | Drama | 42 | 462 | 43200 | 0 |
| 2 | 20.5458 | 69.14 | 0.531 | 39912.675 | 134.6 | 7.485 | 7.570 | 7.495 | 7.515 | 7.44 | 547051 | NO | 147.88 | 2022.400 | Comedy | 38 | 458 | 69400 | 1 |
| 3 | 20.6474 | 59.36 | 0.542 | 38873.890 | 119.3 | 6.895 | 7.035 | 6.920 | 7.020 | 8.26 | 516279 | YES | 185.36 | 225.344 | Drama | 45 | 472 | 66800 | 1 |
| 4 | 21.3810 | 59.36 | 0.542 | 39701.585 | 127.7 | 6.920 | 7.070 | 6.815 | 7.070 | 8.26 | 531448 | NO | 176.48 | 225.792 | Drama | 55 | 395 | 72400 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 501 | 21.2526 | 78.86 | 0.427 | 36624.115 | 142.6 | 8.680 | 8.775 | 8.620 | 8.970 | 6.80 | 492480 | NO | 186.96 | 243.584 | Action | 27 | 561 | 44800 | 0 |
| 502 | 20.9054 | 78.86 | 0.427 | 33996.600 | 150.2 | 8.780 | 8.945 | 8.770 | 8.930 | 7.80 | 482875 | YES | 132.24 | 263.296 | Action | 20 | 600 | 41200 | 0 |
| 503 | 21.2152 | 78.86 | 0.427 | 38751.680 | 164.5 | 8.830 | 8.970 | 8.855 | 9.010 | 7.80 | 532239 | NO | 109.56 | 243.824 | Comedy | 31 | 576 | 47800 | 0 |
| 504 | 22.1918 | 78.86 | 0.427 | 37740.670 | 162.8 | 8.730 | 8.845 | 8.800 | 8.845 | 6.80 | 496077 | YES | 158.80 | 303.520 | Comedy | 47 | 607 | 44000 | 0 |
| 505 | 20.9482 | 78.86 | 0.427 | 33496.650 | 154.3 | 8.640 | 8.880 | 8.680 | 8.790 | 6.80 | 518438 | YES | 205.60 | 203.040 | Comedy | 45 | 604 | 38000 | 0 |
506 rows × 19 columns
# Reading first 5 Rows of the data
df.head()
| Marketing expense | Production expense | Multiplex coverage | Budget | Movie_length | Lead_ Actor_Rating | Lead_Actress_rating | Director_rating | Producer_rating | Critic_rating | Trailer_views | 3D_available | Time_taken | Twitter_hastags | Genre | Avg_age_actors | Num_multiplex | Collection | Tech_Oscar | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20.1264 | 59.62 | 0.462 | 36524.125 | 138.7 | 7.825 | 8.095 | 7.910 | 7.995 | 7.94 | 527367 | YES | 109.60 | 223.840 | Thriller | 23 | 494 | 48000 | 1 |
| 1 | 20.5462 | 69.14 | 0.531 | 35668.655 | 152.4 | 7.505 | 7.650 | 7.440 | 7.470 | 7.44 | 494055 | NO | 146.64 | 243.456 | Drama | 42 | 462 | 43200 | 0 |
| 2 | 20.5458 | 69.14 | 0.531 | 39912.675 | 134.6 | 7.485 | 7.570 | 7.495 | 7.515 | 7.44 | 547051 | NO | 147.88 | 2022.400 | Comedy | 38 | 458 | 69400 | 1 |
| 3 | 20.6474 | 59.36 | 0.542 | 38873.890 | 119.3 | 6.895 | 7.035 | 6.920 | 7.020 | 8.26 | 516279 | YES | 185.36 | 225.344 | Drama | 45 | 472 | 66800 | 1 |
| 4 | 21.3810 | 59.36 | 0.542 | 39701.585 | 127.7 | 6.920 | 7.070 | 6.815 | 7.070 | 8.26 | 531448 | NO | 176.48 | 225.792 | Drama | 55 | 395 | 72400 | 1 |
# Reading last 5 Rows of the data
df.tail()
| Marketing expense | Production expense | Multiplex coverage | Budget | Movie_length | Lead_ Actor_Rating | Lead_Actress_rating | Director_rating | Producer_rating | Critic_rating | Trailer_views | 3D_available | Time_taken | Twitter_hastags | Genre | Avg_age_actors | Num_multiplex | Collection | Tech_Oscar | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 501 | 21.2526 | 78.86 | 0.427 | 36624.115 | 142.6 | 8.68 | 8.775 | 8.620 | 8.970 | 6.8 | 492480 | NO | 186.96 | 243.584 | Action | 27 | 561 | 44800 | 0 |
| 502 | 20.9054 | 78.86 | 0.427 | 33996.600 | 150.2 | 8.78 | 8.945 | 8.770 | 8.930 | 7.8 | 482875 | YES | 132.24 | 263.296 | Action | 20 | 600 | 41200 | 0 |
| 503 | 21.2152 | 78.86 | 0.427 | 38751.680 | 164.5 | 8.83 | 8.970 | 8.855 | 9.010 | 7.8 | 532239 | NO | 109.56 | 243.824 | Comedy | 31 | 576 | 47800 | 0 |
| 504 | 22.1918 | 78.86 | 0.427 | 37740.670 | 162.8 | 8.73 | 8.845 | 8.800 | 8.845 | 6.8 | 496077 | YES | 158.80 | 303.520 | Comedy | 47 | 607 | 44000 | 0 |
| 505 | 20.9482 | 78.86 | 0.427 | 33496.650 | 154.3 | 8.64 | 8.880 | 8.680 | 8.790 | 6.8 | 518438 | YES | 205.60 | 203.040 | Comedy | 45 | 604 | 38000 | 0 |
# Reading last 5 Rows of the data
df.sample()
| Marketing expense | Production expense | Multiplex coverage | Budget | Movie_length | Lead_ Actor_Rating | Lead_Actress_rating | Director_rating | Producer_rating | Critic_rating | Trailer_views | 3D_available | Time_taken | Twitter_hastags | Genre | Avg_age_actors | Num_multiplex | Collection | Tech_Oscar | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 20.5458 | 69.14 | 0.531 | 39912.675 | 134.6 | 7.485 | 7.57 | 7.495 | 7.515 | 7.44 | 547051 | NO | 147.88 | 2022.4 | Comedy | 38 | 458 | 69400 | 1 |
# Reading random 5 Rows of the data
df.sample(5)
| Marketing expense | Production expense | Multiplex coverage | Budget | Movie_length | Lead_ Actor_Rating | Lead_Actress_rating | Director_rating | Producer_rating | Critic_rating | Trailer_views | 3D_available | Time_taken | Twitter_hastags | Genre | Avg_age_actors | Num_multiplex | Collection | Tech_Oscar | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 27 | 39.1154 | 71.28 | 0.462 | 33591.085 | 162.3 | 7.750 | 7.800 | 7.70 | 7.845 | 6.80 | 406866 | YES | 173.92 | 262.368 | Drama | 20 | 488 | 29600 | 0 |
| 108 | 22.5604 | 72.12 | 0.480 | 35963.070 | 170.6 | 8.640 | 8.910 | 8.73 | 8.850 | 7.82 | 473768 | NO | 171.92 | 203.168 | Action | 48 | 502 | 39600 | 1 |
| 111 | 22.0168 | 75.02 | 0.453 | 37301.825 | 155.1 | 8.595 | 8.685 | 8.50 | 8.870 | 8.44 | 495560 | YES | 113.12 | 263.648 | Thriller | 34 | 544 | 45600 | 0 |
| 33 | 43.0344 | 71.28 | 0.462 | 31669.055 | 168.5 | 8.100 | 8.165 | 8.02 | 8.140 | 7.80 | 394967 | YES | 167.24 | 302.096 | Comedy | 25 | 571 | 26200 | 0 |
| 259 | 33.1330 | 62.94 | 0.353 | 38007.310 | 173.5 | 8.900 | 9.090 | 8.84 | 9.145 | 8.40 | 483741 | YES | 146.04 | 224.816 | Thriller | 40 | 599 | 60200 | 0 |
# Checking the shape of the data
df.shape
(506, 19)
# Checking the rows of the data
df.shape[0]
506
# Checking the column of the data
df.shape[1]
19
#Reading the name of the columns
df.columns
Index(['Marketing expense', 'Production expense', 'Multiplex coverage',
'Budget', 'Movie_length', 'Lead_ Actor_Rating', 'Lead_Actress_rating',
'Director_rating', 'Producer_rating', 'Critic_rating', 'Trailer_views',
'3D_available', 'Time_taken', 'Twitter_hastags', 'Genre',
'Avg_age_actors', 'Num_multiplex', 'Collection', 'Tech_Oscar'],
dtype='object')
df.rename(columns={'Genre':'Genre_imp'})
| Marketing expense | Production expense | Multiplex coverage | Budget | Movie_length | Lead_ Actor_Rating | Lead_Actress_rating | Director_rating | Producer_rating | Critic_rating | Trailer_views | 3D_available | Time_taken | Twitter_hastags | Genre_imp | Avg_age_actors | Num_multiplex | Collection | Tech_Oscar | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20.1264 | 59.62 | 0.462 | 36524.125 | 138.7 | 7.825 | 8.095 | 7.910 | 7.995 | 7.94 | 527367 | YES | 109.60 | 223.840 | Thriller | 23 | 494 | 48000 | 1 |
| 1 | 20.5462 | 69.14 | 0.531 | 35668.655 | 152.4 | 7.505 | 7.650 | 7.440 | 7.470 | 7.44 | 494055 | NO | 146.64 | 243.456 | Drama | 42 | 462 | 43200 | 0 |
| 2 | 20.5458 | 69.14 | 0.531 | 39912.675 | 134.6 | 7.485 | 7.570 | 7.495 | 7.515 | 7.44 | 547051 | NO | 147.88 | 2022.400 | Comedy | 38 | 458 | 69400 | 1 |
| 3 | 20.6474 | 59.36 | 0.542 | 38873.890 | 119.3 | 6.895 | 7.035 | 6.920 | 7.020 | 8.26 | 516279 | YES | 185.36 | 225.344 | Drama | 45 | 472 | 66800 | 1 |
| 4 | 21.3810 | 59.36 | 0.542 | 39701.585 | 127.7 | 6.920 | 7.070 | 6.815 | 7.070 | 8.26 | 531448 | NO | 176.48 | 225.792 | Drama | 55 | 395 | 72400 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 501 | 21.2526 | 78.86 | 0.427 | 36624.115 | 142.6 | 8.680 | 8.775 | 8.620 | 8.970 | 6.80 | 492480 | NO | 186.96 | 243.584 | Action | 27 | 561 | 44800 | 0 |
| 502 | 20.9054 | 78.86 | 0.427 | 33996.600 | 150.2 | 8.780 | 8.945 | 8.770 | 8.930 | 7.80 | 482875 | YES | 132.24 | 263.296 | Action | 20 | 600 | 41200 | 0 |
| 503 | 21.2152 | 78.86 | 0.427 | 38751.680 | 164.5 | 8.830 | 8.970 | 8.855 | 9.010 | 7.80 | 532239 | NO | 109.56 | 243.824 | Comedy | 31 | 576 | 47800 | 0 |
| 504 | 22.1918 | 78.86 | 0.427 | 37740.670 | 162.8 | 8.730 | 8.845 | 8.800 | 8.845 | 6.80 | 496077 | YES | 158.80 | 303.520 | Comedy | 47 | 607 | 44000 | 0 |
| 505 | 20.9482 | 78.86 | 0.427 | 33496.650 | 154.3 | 8.640 | 8.880 | 8.680 | 8.790 | 6.80 | 518438 | YES | 205.60 | 203.040 | Comedy | 45 | 604 | 38000 | 0 |
506 rows × 19 columns
df.dtypes
Marketing expense float64 Production expense float64 Multiplex coverage float64 Budget float64 Movie_length float64 Lead_ Actor_Rating float64 Lead_Actress_rating float64 Director_rating float64 Producer_rating float64 Critic_rating float64 Trailer_views int64 3D_available object Time_taken float64 Twitter_hastags float64 Genre object Avg_age_actors int64 Num_multiplex int64 Collection int64 Tech_Oscar int64 dtype: object
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Marketing expense 506 non-null float64 1 Production expense 506 non-null float64 2 Multiplex coverage 506 non-null float64 3 Budget 506 non-null float64 4 Movie_length 506 non-null float64 5 Lead_ Actor_Rating 506 non-null float64 6 Lead_Actress_rating 506 non-null float64 7 Director_rating 506 non-null float64 8 Producer_rating 506 non-null float64 9 Critic_rating 506 non-null float64 10 Trailer_views 506 non-null int64 11 3D_available 506 non-null object 12 Time_taken 494 non-null float64 13 Twitter_hastags 506 non-null float64 14 Genre 506 non-null object 15 Avg_age_actors 506 non-null int64 16 Num_multiplex 506 non-null int64 17 Collection 506 non-null int64 18 Tech_Oscar 506 non-null int64 dtypes: float64(12), int64(5), object(2) memory usage: 75.2+ KB
df.isnull()
| Marketing expense | Production expense | Multiplex coverage | Budget | Movie_length | Lead_ Actor_Rating | Lead_Actress_rating | Director_rating | Producer_rating | Critic_rating | Trailer_views | 3D_available | Time_taken | Twitter_hastags | Genre | Avg_age_actors | Num_multiplex | Collection | Tech_Oscar | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 1 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 2 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 3 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 4 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 501 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 502 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 503 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 504 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
| 505 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
506 rows × 19 columns
df.isnull().sum()
Marketing expense 0 Production expense 0 Multiplex coverage 0 Budget 0 Movie_length 0 Lead_ Actor_Rating 0 Lead_Actress_rating 0 Director_rating 0 Producer_rating 0 Critic_rating 0 Trailer_views 0 3D_available 0 Time_taken 12 Twitter_hastags 0 Genre 0 Avg_age_actors 0 Num_multiplex 0 Collection 0 Tech_Oscar 0 dtype: int64
Numerical Columns
Mean/Median /Mode ----> not Times Series
Algorithm Imputations
Time series----> Ffill, bfill, interploation
df.Budget
0 36524.125
1 35668.655
2 39912.675
3 38873.890
4 39701.585
...
501 36624.115
502 33996.600
503 38751.680
504 37740.670
505 33496.650
Name: Budget, Length: 506, dtype: float64
df["Budget"]
0 36524.125
1 35668.655
2 39912.675
3 38873.890
4 39701.585
...
501 36624.115
502 33996.600
503 38751.680
504 37740.670
505 33496.650
Name: Budget, Length: 506, dtype: float64
df.Genre
0 Thriller
1 Drama
2 Comedy
3 Drama
4 Drama
...
501 Action
502 Action
503 Comedy
504 Comedy
505 Comedy
Name: Genre, Length: 506, dtype: object
plt.scatter(df['Marketing expense'],df['Production expense'])
<matplotlib.collections.PathCollection at 0x39c16f8460>
# Creating the Data Dictionary with first column being datatype.
Data_dict = pd.DataFrame(df.dtypes)
Data_dict
| 0 | |
|---|---|
| Marketing expense | float64 |
| Production expense | float64 |
| Multiplex coverage | float64 |
| Budget | float64 |
| Movie_length | float64 |
| Lead_ Actor_Rating | float64 |
| Lead_Actress_rating | float64 |
| Director_rating | float64 |
| Producer_rating | float64 |
| Critic_rating | float64 |
| Trailer_views | int64 |
| 3D_available | object |
| Time_taken | float64 |
| Twitter_hastags | float64 |
| Genre | object |
| Avg_age_actors | int64 |
| Num_multiplex | int64 |
| Collection | int64 |
| Tech_Oscar | int64 |
# identifying the missing values from the dataset.
Data_dict['MissingVal'] = df.isnull().sum()
Data_dict
| 0 | MissingVal | |
|---|---|---|
| Marketing expense | float64 | 0 |
| Production expense | float64 | 0 |
| Multiplex coverage | float64 | 0 |
| Budget | float64 | 0 |
| Movie_length | float64 | 0 |
| Lead_ Actor_Rating | float64 | 0 |
| Lead_Actress_rating | float64 | 0 |
| Director_rating | float64 | 0 |
| Producer_rating | float64 | 0 |
| Critic_rating | float64 | 0 |
| Trailer_views | int64 | 0 |
| 3D_available | object | 0 |
| Time_taken | float64 | 12 |
| Twitter_hastags | float64 | 0 |
| Genre | object | 0 |
| Avg_age_actors | int64 | 0 |
| Num_multiplex | int64 | 0 |
| Collection | int64 | 0 |
| Tech_Oscar | int64 | 0 |
# Identifying unique values . For this we used nunique() which returns unique elements in the object.
Data_dict['UniqueVal'] = df.nunique()
Data_dict
| 0 | MissingVal | UniqueVal | |
|---|---|---|---|
| Marketing expense | float64 | 0 | 504 |
| Production expense | float64 | 0 | 76 |
| Multiplex coverage | float64 | 0 | 81 |
| Budget | float64 | 0 | 446 |
| Movie_length | float64 | 0 | 356 |
| Lead_ Actor_Rating | float64 | 0 | 339 |
| Lead_Actress_rating | float64 | 0 | 354 |
| Director_rating | float64 | 0 | 339 |
| Producer_rating | float64 | 0 | 353 |
| Critic_rating | float64 | 0 | 74 |
| Trailer_views | int64 | 0 | 504 |
| 3D_available | object | 0 | 2 |
| Time_taken | float64 | 12 | 449 |
| Twitter_hastags | float64 | 0 | 423 |
| Genre | object | 0 | 4 |
| Avg_age_actors | int64 | 0 | 42 |
| Num_multiplex | int64 | 0 | 293 |
| Collection | int64 | 0 | 228 |
| Tech_Oscar | int64 | 0 | 2 |
# identifying count of the variable.
Data_dict['Count'] = df.count()
Data_dict
| 0 | MissingVal | UniqueVal | Count | |
|---|---|---|---|---|
| Marketing expense | float64 | 0 | 504 | 506 |
| Production expense | float64 | 0 | 76 | 506 |
| Multiplex coverage | float64 | 0 | 81 | 506 |
| Budget | float64 | 0 | 446 | 506 |
| Movie_length | float64 | 0 | 356 | 506 |
| Lead_ Actor_Rating | float64 | 0 | 339 | 506 |
| Lead_Actress_rating | float64 | 0 | 354 | 506 |
| Director_rating | float64 | 0 | 339 | 506 |
| Producer_rating | float64 | 0 | 353 | 506 |
| Critic_rating | float64 | 0 | 74 | 506 |
| Trailer_views | int64 | 0 | 504 | 506 |
| 3D_available | object | 0 | 2 | 506 |
| Time_taken | float64 | 12 | 449 | 494 |
| Twitter_hastags | float64 | 0 | 423 | 506 |
| Genre | object | 0 | 4 | 506 |
| Avg_age_actors | int64 | 0 | 42 | 506 |
| Num_multiplex | int64 | 0 | 293 | 506 |
| Collection | int64 | 0 | 228 | 506 |
| Tech_Oscar | int64 | 0 | 2 | 506 |
# view the descriptive statistics of the dataset
df.describe()
| Marketing expense | Production expense | Multiplex coverage | Budget | Movie_length | Lead_ Actor_Rating | Lead_Actress_rating | Director_rating | Producer_rating | Critic_rating | Trailer_views | Time_taken | Twitter_hastags | Avg_age_actors | Num_multiplex | Collection | Tech_Oscar | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 494.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 |
| mean | 92.270471 | 77.273557 | 0.445305 | 34911.144022 | 142.074901 | 8.014002 | 8.185613 | 8.019664 | 8.190514 | 7.810870 | 449860.715415 | 157.391498 | 260.832095 | 39.181818 | 545.043478 | 45057.707510 | 0.545455 |
| std | 172.030902 | 13.720706 | 0.115878 | 3903.038232 | 28.148861 | 1.054266 | 1.054290 | 1.059899 | 1.049601 | 0.659699 | 68917.763145 | 31.295161 | 104.779133 | 12.513697 | 106.332889 | 18364.351764 | 0.498422 |
| min | 20.126400 | 55.920000 | 0.129000 | 19781.355000 | 76.400000 | 3.840000 | 4.035000 | 3.840000 | 4.030000 | 6.600000 | 212912.000000 | 0.000000 | 201.152000 | 3.000000 | 333.000000 | 10000.000000 | 0.000000 |
| 25% | 21.640900 | 65.380000 | 0.376000 | 32693.952500 | 118.525000 | 7.316250 | 7.503750 | 7.296250 | 7.507500 | 7.200000 | 409128.000000 | 132.300000 | 223.796000 | 28.000000 | 465.000000 | 34050.000000 | 0.000000 |
| 50% | 25.130200 | 74.380000 | 0.462000 | 34488.217500 | 151.000000 | 8.307500 | 8.495000 | 8.312500 | 8.465000 | 7.960000 | 462460.000000 | 160.000000 | 254.400000 | 39.000000 | 535.500000 | 42400.000000 | 1.000000 |
| 75% | 93.541650 | 91.200000 | 0.551000 | 36793.542500 | 167.575000 | 8.865000 | 9.030000 | 8.883750 | 9.030000 | 8.260000 | 500247.500000 | 181.890000 | 283.416000 | 50.000000 | 614.750000 | 50000.000000 | 1.000000 |
| max | 1799.524000 | 110.480000 | 0.615000 | 48772.900000 | 173.500000 | 9.435000 | 9.540000 | 9.425000 | 9.635000 | 9.400000 | 567784.000000 | 217.520000 | 2022.400000 | 60.000000 | 868.000000 | 100000.000000 | 1.000000 |
plt.hist(df['Marketing expense'])
(array([439., 44., 14., 1., 3., 2., 0., 1., 1., 1.]),
array([ 20.1264 , 198.06616, 376.00592, 553.94568, 731.88544,
909.8252 , 1087.76496, 1265.70472, 1443.64448, 1621.58424,
1799.524 ]),
<BarContainer object of 10 artists>)
# get discriptive statistics on "number" datatypes
df.describe(include = ['number'])
| Marketing expense | Production expense | Multiplex coverage | Budget | Movie_length | Lead_ Actor_Rating | Lead_Actress_rating | Director_rating | Producer_rating | Critic_rating | Trailer_views | Time_taken | Twitter_hastags | Avg_age_actors | Num_multiplex | Collection | Tech_Oscar | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 494.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 |
| mean | 92.270471 | 77.273557 | 0.445305 | 34911.144022 | 142.074901 | 8.014002 | 8.185613 | 8.019664 | 8.190514 | 7.810870 | 449860.715415 | 157.391498 | 260.832095 | 39.181818 | 545.043478 | 45057.707510 | 0.545455 |
| std | 172.030902 | 13.720706 | 0.115878 | 3903.038232 | 28.148861 | 1.054266 | 1.054290 | 1.059899 | 1.049601 | 0.659699 | 68917.763145 | 31.295161 | 104.779133 | 12.513697 | 106.332889 | 18364.351764 | 0.498422 |
| min | 20.126400 | 55.920000 | 0.129000 | 19781.355000 | 76.400000 | 3.840000 | 4.035000 | 3.840000 | 4.030000 | 6.600000 | 212912.000000 | 0.000000 | 201.152000 | 3.000000 | 333.000000 | 10000.000000 | 0.000000 |
| 25% | 21.640900 | 65.380000 | 0.376000 | 32693.952500 | 118.525000 | 7.316250 | 7.503750 | 7.296250 | 7.507500 | 7.200000 | 409128.000000 | 132.300000 | 223.796000 | 28.000000 | 465.000000 | 34050.000000 | 0.000000 |
| 50% | 25.130200 | 74.380000 | 0.462000 | 34488.217500 | 151.000000 | 8.307500 | 8.495000 | 8.312500 | 8.465000 | 7.960000 | 462460.000000 | 160.000000 | 254.400000 | 39.000000 | 535.500000 | 42400.000000 | 1.000000 |
| 75% | 93.541650 | 91.200000 | 0.551000 | 36793.542500 | 167.575000 | 8.865000 | 9.030000 | 8.883750 | 9.030000 | 8.260000 | 500247.500000 | 181.890000 | 283.416000 | 50.000000 | 614.750000 | 50000.000000 | 1.000000 |
| max | 1799.524000 | 110.480000 | 0.615000 | 48772.900000 | 173.500000 | 9.435000 | 9.540000 | 9.425000 | 9.635000 | 9.400000 | 567784.000000 | 217.520000 | 2022.400000 | 60.000000 | 868.000000 | 100000.000000 | 1.000000 |
# get discriptive statistics on "objects" datatypes
df.describe(include = ['object'])
| 3D_available | Genre | |
|---|---|---|
| count | 506 | 506 |
| unique | 2 | 4 |
| top | YES | Thriller |
| freq | 279 | 183 |
df.mean()
Marketing expense 92.270471 Production expense 77.273557 Multiplex coverage 0.445305 Budget 34911.144022 Movie_length 142.074901 Lead_ Actor_Rating 8.014002 Lead_Actress_rating 8.185613 Director_rating 8.019664 Producer_rating 8.190514 Critic_rating 7.810870 Trailer_views 449860.715415 Time_taken 157.391498 Twitter_hastags 260.832095 Avg_age_actors 39.181818 Num_multiplex 545.043478 Collection 45057.707510 Tech_Oscar 0.545455 dtype: float64
df.median()
Marketing expense 25.1302 Production expense 74.3800 Multiplex coverage 0.4620 Budget 34488.2175 Movie_length 151.0000 Lead_ Actor_Rating 8.3075 Lead_Actress_rating 8.4950 Director_rating 8.3125 Producer_rating 8.4650 Critic_rating 7.9600 Trailer_views 462460.0000 Time_taken 160.0000 Twitter_hastags 254.4000 Avg_age_actors 39.0000 Num_multiplex 535.5000 Collection 42400.0000 Tech_Oscar 1.0000 dtype: float64
plt.hist(df['Time_taken'])
(array([ 2., 0., 0., 0., 7., 108., 101., 106., 110., 60.]),
array([ 0. , 21.752, 43.504, 65.256, 87.008, 108.76 , 130.512,
152.264, 174.016, 195.768, 217.52 ]),
<BarContainer object of 10 artists>)
# calculating the mean of the Time_taken
df['Time_taken'].mean()
157.39149797570855
# calculating the mean of the Time_taken and replace the missing value with mean
df['Time_taken'].fillna(value = df['Time_taken'].mean(), inplace = True)
# View info of Columns of the dataset such as number of entries, name of columns and data type
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Marketing expense 506 non-null float64 1 Production expense 506 non-null float64 2 Multiplex coverage 506 non-null float64 3 Budget 506 non-null float64 4 Movie_length 506 non-null float64 5 Lead_ Actor_Rating 506 non-null float64 6 Lead_Actress_rating 506 non-null float64 7 Director_rating 506 non-null float64 8 Producer_rating 506 non-null float64 9 Critic_rating 506 non-null float64 10 Trailer_views 506 non-null int64 11 3D_available 506 non-null object 12 Time_taken 506 non-null float64 13 Twitter_hastags 506 non-null float64 14 Genre 506 non-null object 15 Avg_age_actors 506 non-null int64 16 Num_multiplex 506 non-null int64 17 Collection 506 non-null int64 18 Tech_Oscar 506 non-null int64 dtypes: float64(12), int64(5), object(2) memory usage: 75.2+ KB
# checking the null values column wise
df.isnull().sum()
Marketing expense 0 Production expense 0 Multiplex coverage 0 Budget 0 Movie_length 0 Lead_ Actor_Rating 0 Lead_Actress_rating 0 Director_rating 0 Producer_rating 0 Critic_rating 0 Trailer_views 0 3D_available 0 Time_taken 0 Twitter_hastags 0 Genre 0 Avg_age_actors 0 Num_multiplex 0 Collection 0 Tech_Oscar 0 dtype: int64
#Checking the null values of complete dataset
df.isnull().sum().sum()
0
# checking the first five rows of the dataset
df.head()
| Marketing expense | Production expense | Multiplex coverage | Budget | Movie_length | Lead_ Actor_Rating | Lead_Actress_rating | Director_rating | Producer_rating | Critic_rating | Trailer_views | 3D_available | Time_taken | Twitter_hastags | Genre | Avg_age_actors | Num_multiplex | Collection | Tech_Oscar | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20.1264 | 59.62 | 0.462 | 36524.125 | 138.7 | 7.825 | 8.095 | 7.910 | 7.995 | 7.94 | 527367 | YES | 109.60 | 223.840 | Thriller | 23 | 494 | 48000 | 1 |
| 1 | 20.5462 | 69.14 | 0.531 | 35668.655 | 152.4 | 7.505 | 7.650 | 7.440 | 7.470 | 7.44 | 494055 | NO | 146.64 | 243.456 | Drama | 42 | 462 | 43200 | 0 |
| 2 | 20.5458 | 69.14 | 0.531 | 39912.675 | 134.6 | 7.485 | 7.570 | 7.495 | 7.515 | 7.44 | 547051 | NO | 147.88 | 2022.400 | Comedy | 38 | 458 | 69400 | 1 |
| 3 | 20.6474 | 59.36 | 0.542 | 38873.890 | 119.3 | 6.895 | 7.035 | 6.920 | 7.020 | 8.26 | 516279 | YES | 185.36 | 225.344 | Drama | 45 | 472 | 66800 | 1 |
| 4 | 21.3810 | 59.36 | 0.542 | 39701.585 | 127.7 | 6.920 | 7.070 | 6.815 | 7.070 | 8.26 | 531448 | NO | 176.48 | 225.792 | Drama | 55 | 395 | 72400 | 1 |
# Converting non numerical column into numerical usin get dummies method
df = pd.get_dummies(df,columns = ["3D_available","Genre"],drop_first = True)
df.head()
| Marketing expense | Production expense | Multiplex coverage | Budget | Movie_length | Lead_ Actor_Rating | Lead_Actress_rating | Director_rating | Producer_rating | Critic_rating | ... | Time_taken | Twitter_hastags | Avg_age_actors | Num_multiplex | Collection | Tech_Oscar | 3D_available_YES | Genre_Comedy | Genre_Drama | Genre_Thriller | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20.1264 | 59.62 | 0.462 | 36524.125 | 138.7 | 7.825 | 8.095 | 7.910 | 7.995 | 7.94 | ... | 109.60 | 223.840 | 23 | 494 | 48000 | 1 | 1 | 0 | 0 | 1 |
| 1 | 20.5462 | 69.14 | 0.531 | 35668.655 | 152.4 | 7.505 | 7.650 | 7.440 | 7.470 | 7.44 | ... | 146.64 | 243.456 | 42 | 462 | 43200 | 0 | 0 | 0 | 1 | 0 |
| 2 | 20.5458 | 69.14 | 0.531 | 39912.675 | 134.6 | 7.485 | 7.570 | 7.495 | 7.515 | 7.44 | ... | 147.88 | 2022.400 | 38 | 458 | 69400 | 1 | 0 | 1 | 0 | 0 |
| 3 | 20.6474 | 59.36 | 0.542 | 38873.890 | 119.3 | 6.895 | 7.035 | 6.920 | 7.020 | 8.26 | ... | 185.36 | 225.344 | 45 | 472 | 66800 | 1 | 1 | 0 | 1 | 0 |
| 4 | 21.3810 | 59.36 | 0.542 | 39701.585 | 127.7 | 6.920 | 7.070 | 6.815 | 7.070 | 8.26 | ... | 176.48 | 225.792 | 55 | 395 | 72400 | 1 | 0 | 0 | 1 | 0 |
5 rows × 21 columns
# checking the number of rows and columns after converting complete dataset into numerical
df.shape
(506, 21)
21*21
441
# Visulaizing the Pairplot of complete dataset
sns.pairplot(df)
<seaborn.axisgrid.PairGrid at 0x66f3eccb70>
# calculating the correlation of complete dataset
corr = df.corr()
corr
| Marketing expense | Production expense | Multiplex coverage | Budget | Movie_length | Lead_ Actor_Rating | Lead_Actress_rating | Director_rating | Producer_rating | Critic_rating | ... | Time_taken | Twitter_hastags | Avg_age_actors | Num_multiplex | Collection | Tech_Oscar | 3D_available_YES | Genre_Comedy | Genre_Drama | Genre_Thriller | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Marketing expense | 1.000000 | 0.406583 | -0.420972 | -0.219247 | 0.352734 | 0.380050 | 0.379813 | 0.380069 | 0.376462 | -0.184985 | ... | 0.025694 | 0.013518 | 0.059204 | 0.383298 | -0.389582 | -0.013417 | -0.086805 | 0.066796 | -0.016894 | -0.037123 |
| Production expense | 0.406583 | 1.000000 | -0.763651 | -0.391676 | 0.644779 | 0.706481 | 0.707956 | 0.707566 | 0.705819 | -0.251565 | ... | 0.015773 | -0.000839 | 0.055810 | 0.707559 | -0.484754 | -0.024404 | -0.115401 | 0.086958 | -0.026590 | -0.098976 |
| Multiplex coverage | -0.420972 | -0.763651 | 1.000000 | 0.302188 | -0.731470 | -0.768589 | -0.769724 | -0.769157 | -0.764873 | 0.145555 | ... | 0.035515 | 0.004882 | -0.092104 | -0.915495 | 0.429300 | -0.004017 | 0.073903 | -0.068554 | 0.046393 | 0.037772 |
| Budget | -0.219247 | -0.391676 | 0.302188 | 1.000000 | -0.240265 | -0.208464 | -0.203981 | -0.201907 | -0.205397 | 0.232361 | ... | 0.040439 | 0.030674 | -0.064694 | -0.282796 | 0.696304 | -0.027148 | 0.163774 | -0.052579 | -0.004195 | 0.046251 |
| Movie_length | 0.352734 | 0.644779 | -0.731470 | -0.240265 | 1.000000 | 0.746904 | 0.746493 | 0.747021 | 0.746707 | -0.217830 | ... | -0.019820 | 0.009380 | 0.075198 | 0.673896 | -0.377999 | 0.016291 | 0.005101 | 0.092693 | 0.003452 | -0.088609 |
| Lead_ Actor_Rating | 0.380050 | 0.706481 | -0.768589 | -0.208464 | 0.746904 | 1.000000 | 0.997905 | 0.997735 | 0.994073 | -0.169978 | ... | 0.038050 | 0.014463 | 0.036794 | 0.706331 | -0.251355 | -0.035309 | -0.025208 | 0.044592 | -0.035171 | -0.030763 |
| Lead_Actress_rating | 0.379813 | 0.707956 | -0.769724 | -0.203981 | 0.746493 | 0.997905 | 1.000000 | 0.998097 | 0.994003 | -0.165992 | ... | 0.037975 | 0.010239 | 0.038005 | 0.708257 | -0.249459 | -0.040356 | -0.020056 | 0.046974 | -0.038965 | -0.030566 |
| Director_rating | 0.380069 | 0.707566 | -0.769157 | -0.201907 | 0.747021 | 0.997735 | 0.998097 | 1.000000 | 0.994126 | -0.166638 | ... | 0.035881 | 0.010077 | 0.041470 | 0.709364 | -0.246650 | -0.035768 | -0.020195 | 0.046268 | -0.033510 | -0.033634 |
| Producer_rating | 0.376462 | 0.705819 | -0.764873 | -0.205397 | 0.746707 | 0.994073 | 0.994003 | 0.994126 | 1.000000 | -0.167003 | ... | 0.028695 | 0.005850 | 0.032542 | 0.703518 | -0.248200 | -0.043612 | -0.020022 | 0.051274 | -0.031696 | -0.033829 |
| Critic_rating | -0.184985 | -0.251565 | 0.145555 | 0.232361 | -0.217830 | -0.169978 | -0.165992 | -0.166638 | -0.167003 | 1.000000 | ... | -0.014762 | -0.023655 | -0.049797 | -0.128769 | 0.341288 | -0.001084 | 0.039235 | -0.015253 | 0.057177 | -0.037129 |
| Trailer_views | -0.443457 | -0.591657 | 0.581386 | 0.602536 | -0.589318 | -0.490267 | -0.487536 | -0.486452 | -0.487911 | 0.228641 | ... | 0.074517 | -0.006704 | -0.049726 | -0.544100 | 0.720119 | -0.075783 | 0.090664 | -0.106439 | -0.000179 | 0.109849 |
| Time_taken | 0.025694 | 0.015773 | 0.035515 | 0.040439 | -0.019820 | 0.038050 | 0.037975 | 0.035881 | 0.028695 | -0.014762 | ... | 1.000000 | -0.006382 | 0.072049 | -0.056704 | 0.110005 | -0.063753 | -0.024431 | 0.012908 | 0.049285 | -0.098138 |
| Twitter_hastags | 0.013518 | -0.000839 | 0.004882 | 0.030674 | 0.009380 | 0.014463 | 0.010239 | 0.010077 | 0.005850 | -0.023655 | ... | -0.006382 | 1.000000 | -0.004840 | 0.006255 | 0.023122 | 0.077333 | -0.066012 | 0.034407 | 0.036442 | -0.058431 |
| Avg_age_actors | 0.059204 | 0.055810 | -0.092104 | -0.064694 | 0.075198 | 0.036794 | 0.038005 | 0.041470 | 0.032542 | -0.049797 | ... | 0.072049 | -0.004840 | 1.000000 | 0.078811 | -0.047426 | 0.040581 | -0.013581 | -0.030584 | -0.015918 | -0.036611 |
| Num_multiplex | 0.383298 | 0.707559 | -0.915495 | -0.282796 | 0.673896 | 0.706331 | 0.708257 | 0.709364 | 0.703518 | -0.128769 | ... | -0.056704 | 0.006255 | 0.078811 | 1.000000 | -0.391729 | -0.004857 | -0.052262 | 0.070720 | -0.035126 | -0.048863 |
| Collection | -0.389582 | -0.484754 | 0.429300 | 0.696304 | -0.377999 | -0.251355 | -0.249459 | -0.246650 | -0.248200 | 0.341288 | ... | 0.110005 | 0.023122 | -0.047426 | -0.391729 | 1.000000 | 0.154698 | 0.182867 | -0.077478 | 0.036233 | 0.071751 |
| Tech_Oscar | -0.013417 | -0.024404 | -0.004017 | -0.027148 | 0.016291 | -0.035309 | -0.040356 | -0.035768 | -0.043612 | -0.001084 | ... | -0.063753 | 0.077333 | 0.040581 | -0.004857 | 0.154698 | 1.000000 | 0.070371 | 0.021134 | 0.061414 | -0.072842 |
| 3D_available_YES | -0.086805 | -0.115401 | 0.073903 | 0.163774 | 0.005101 | -0.025208 | -0.020056 | -0.020195 | -0.020022 | 0.039235 | ... | -0.024431 | -0.066012 | -0.013581 | -0.052262 | 0.182867 | 0.070371 | 1.000000 | 0.004617 | 0.035491 | 0.017341 |
| Genre_Comedy | 0.066796 | 0.086958 | -0.068554 | -0.052579 | 0.092693 | 0.044592 | 0.046974 | 0.046268 | 0.051274 | -0.015253 | ... | 0.012908 | 0.034407 | -0.030584 | 0.070720 | -0.077478 | 0.021134 | 0.004617 | 1.000000 | -0.323621 | -0.500192 |
| Genre_Drama | -0.016894 | -0.026590 | 0.046393 | -0.004195 | 0.003452 | -0.035171 | -0.038965 | -0.033510 | -0.031696 | 0.057177 | ... | 0.049285 | 0.036442 | -0.015918 | -0.035126 | 0.036233 | 0.061414 | 0.035491 | -0.323621 | 1.000000 | -0.366563 |
| Genre_Thriller | -0.037123 | -0.098976 | 0.037772 | 0.046251 | -0.088609 | -0.030763 | -0.030566 | -0.033634 | -0.033829 | -0.037129 | ... | -0.098138 | -0.058431 | -0.036611 | -0.048863 | 0.071751 | -0.072842 | 0.017341 | -0.500192 | -0.366563 | 1.000000 |
21 rows × 21 columns
# Visulaizing the heatmap of complete dataset
sns.heatmap(corr)
<AxesSubplot:>
# Separating the output from the dataset
X = df.loc[:,df.columns!="Tech_Oscar"]
type(X)
pandas.core.frame.DataFrame
# Checking the first fove rows of the input columns
X.head()
| Marketing expense | Production expense | Multiplex coverage | Budget | Movie_length | Lead_ Actor_Rating | Lead_Actress_rating | Director_rating | Producer_rating | Critic_rating | Trailer_views | Time_taken | Twitter_hastags | Avg_age_actors | Num_multiplex | Collection | 3D_available_YES | Genre_Comedy | Genre_Drama | Genre_Thriller | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20.1264 | 59.62 | 0.462 | 36524.125 | 138.7 | 7.825 | 8.095 | 7.910 | 7.995 | 7.94 | 527367 | 109.60 | 223.840 | 23 | 494 | 48000 | 1 | 0 | 0 | 1 |
| 1 | 20.5462 | 69.14 | 0.531 | 35668.655 | 152.4 | 7.505 | 7.650 | 7.440 | 7.470 | 7.44 | 494055 | 146.64 | 243.456 | 42 | 462 | 43200 | 0 | 0 | 1 | 0 |
| 2 | 20.5458 | 69.14 | 0.531 | 39912.675 | 134.6 | 7.485 | 7.570 | 7.495 | 7.515 | 7.44 | 547051 | 147.88 | 2022.400 | 38 | 458 | 69400 | 0 | 1 | 0 | 0 |
| 3 | 20.6474 | 59.36 | 0.542 | 38873.890 | 119.3 | 6.895 | 7.035 | 6.920 | 7.020 | 8.26 | 516279 | 185.36 | 225.344 | 45 | 472 | 66800 | 1 | 0 | 1 | 0 |
| 4 | 21.3810 | 59.36 | 0.542 | 39701.585 | 127.7 | 6.920 | 7.070 | 6.815 | 7.070 | 8.26 | 531448 | 176.48 | 225.792 | 55 | 395 | 72400 | 0 | 0 | 1 | 0 |
# Checking the shape of the input dataset
X.shape
(506, 20)
# Creating output column
y = df["Tech_Oscar"]
type(y)
pandas.core.series.Series
# Checking teh First five rows of the output
y.head()
0 1 1 0 2 1 3 1 4 1 Name: Tech_Oscar, dtype: int64
# Checking the number of rows and coliumns of the output
y.shape
(506,)
# Importing the tarin test split package
from sklearn.model_selection import train_test_split
# Separating the Training and testing Data
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2,random_state=0)
# Checking first five rows of input training dataset
X_train.head()
| Marketing expense | Production expense | Multiplex coverage | Budget | Movie_length | Lead_ Actor_Rating | Lead_Actress_rating | Director_rating | Producer_rating | Critic_rating | Trailer_views | Time_taken | Twitter_hastags | Avg_age_actors | Num_multiplex | Collection | 3D_available_YES | Genre_Comedy | Genre_Drama | Genre_Thriller | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 220 | 27.1618 | 67.40 | 0.493 | 38612.805 | 162.0 | 8.485 | 8.640 | 8.485 | 8.670 | 8.52 | 480270 | 174.68 | 224.272 | 23 | 536 | 53400 | 0 | 0 | 0 | 1 |
| 71 | 23.1752 | 76.62 | 0.587 | 33113.355 | 91.0 | 7.280 | 7.400 | 7.290 | 7.455 | 8.16 | 491978 | 200.68 | 263.472 | 46 | 400 | 43400 | 0 | 0 | 0 | 0 |
| 240 | 22.2658 | 64.86 | 0.572 | 38312.835 | 127.8 | 6.755 | 6.935 | 6.800 | 6.840 | 8.68 | 470107 | 204.80 | 224.320 | 24 | 387 | 54000 | 1 | 1 | 0 | 0 |
| 6 | 21.7658 | 70.74 | 0.476 | 33396.660 | 140.1 | 7.065 | 7.265 | 7.150 | 7.400 | 8.96 | 459241 | 139.16 | 243.664 | 41 | 522 | 45800 | 1 | 0 | 0 | 1 |
| 417 | 538.8120 | 91.20 | 0.321 | 29463.720 | 162.6 | 9.135 | 9.305 | 9.095 | 9.165 | 6.96 | 302776 | 172.16 | 301.664 | 60 | 589 | 20800 | 1 | 0 | 0 | 0 |
# Checking the number of rows and columns of input training dataset
X_train.shape
(404, 20)
# Checking the number of rows and columns of input testing dataset
X_test.shape
(102, 20)
Data
Rows= Observations/Record/Sample/Tuple
Columns= Attribute/Variable/Features
Algorithmns/Computer = Numbers
A- 1000 to 100000
B - 0 to 10
C 0 and 1
# Importing the standardscaler package for standardization
from sklearn.preprocessing import StandardScaler
# Applying the standardscaler to training data
sc = StandardScaler().fit(X_train)
# Transforming the training data into standard
X_train_std = sc.transform(X_train)
X_train_std
array([[-0.37257438, -0.70492455, 0.42487874, ..., -0.66547513,
-0.48525664, 1.3293319 ],
[-0.39709866, -0.04487755, 1.24185891, ..., -0.66547513,
-0.48525664, -0.75225758],
[-0.402693 , -0.88675963, 1.11148974, ..., 1.50268577,
-0.48525664, -0.75225758],
...,
[-0.39805586, -0.15941933, 0.0772276 , ..., -0.66547513,
-0.48525664, -0.75225758],
[-0.38842357, -0.60326872, 0.93766417, ..., -0.66547513,
-0.48525664, 1.3293319 ],
[-0.39951258, -1.01275558, 0.3988049 , ..., -0.66547513,
-0.48525664, 1.3293319 ]])
Clean Data ---> Training and Testing
Clean Data (Training Data ) +Algorithm = Model
Testing Data ---> Model =Prediction
Clean Data ---> Algorith
(EDA--->Clean Data) + Algorithm (ML/DL) = Model
# Transforming the input testing data
X_test_std = sc.transform(X_test)
X_test_std
array([[-0.40835869, -1.12872913, 0.83336883, ..., 1.50268577,
-0.48525664, -0.75225758],
[ 0.71925111, 0.9988844 , -0.65283979, ..., 1.50268577,
-0.48525664, -0.75225758],
[-0.40257488, 0.39610829, 0.05115377, ..., 1.50268577,
-0.48525664, -0.75225758],
...,
[-0.3982601 , -0.85812418, 0.89420778, ..., -0.66547513,
-0.48525664, 1.3293319 ],
[-0.39934279, -0.07637654, 0.58132175, ..., 1.50268577,
-0.48525664, -0.75225758],
[-0.40088071, -0.36702631, 0.31189212, ..., -0.66547513,
-0.48525664, -0.75225758]])
# Importing the SVM Classifier
from sklearn import svm
# Applying the SVM Classifier to training data
clf_svm_l = svm.SVC(kernel='linear', C=100)
clf_svm_l.fit(X_train_std, y_train)
SVC(C=100, kernel='linear')
Recall= TP/TP+FN =1
Precision= TP/TP+FP =1
TP
TN
FP
FN
# Predicting the Values from trainimg and testing
#y_train_pred = clf_svm_l.predict(X_train_std)
y_test_pred = clf_svm_l.predict(X_test_std)
# Checking the predicting values
y_test_pred
array([1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0,
1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0,
0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0,
1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0], dtype=int64)
# Importing accuracy-score and confusion_matrix pacakge
from sklearn.metrics import accuracy_score, confusion_matrix
# Checking the confision matrix
confusion_matrix(y_test, y_test_pred)
array([[25, 19],
[25, 33]], dtype=int64)
# Checking the accuracy Score
accuracy_score(y_test, y_test_pred)
0.5686274509803921
# Checking the parameter of SVM
clf_svm_l.n_support_
array([144, 146])
# Importing the Hyperparameter optimization Gridsearchcv pacakges
from sklearn.model_selection import GridSearchCV
# Setting the different hyperparamter value C
params = {'C':(0.001,0.005,0.01,0.05, 0.1, 0.5, 1, 5, 10, 50,100,500,1000)}
# Creating objevt of SVC Classifier
clf_svm_l = svm.SVC(kernel='linear')
# Applying hyperparameter GridSearchCV
svm_grid_lin = GridSearchCV(clf_svm_l, params, n_jobs=-1,
cv=10, verbose=1, scoring='accuracy')
# Applying training data to GridSearchCV
svm_grid_lin.fit(X_train_std, y_train)
Fitting 10 folds for each of 13 candidates, totalling 130 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 56 tasks | elapsed: 5.1s [Parallel(n_jobs=-1)]: Done 130 out of 130 | elapsed: 2.2min finished
GridSearchCV(cv=10, estimator=SVC(kernel='linear'), n_jobs=-1,
param_grid={'C': (0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50,
100, 500, 1000)},
scoring='accuracy', verbose=1)
# Checking the best parameter of SVM
svm_grid_lin.best_params_
{'C': 0.1}
# Applying the best parameter value to classifier
linsvm_clf = svm_grid_lin.best_estimator_
# Checking the accurcay score on best parameter value
accuracy_score(y_test, linsvm_clf.predict(X_test_std))
0.5980392156862745
# Applying SVC Classifier with Polynomial Kernel
clf_svm_p3 = svm.SVC(kernel='poly', degree=2, C=0.1)
clf_svm_p3.fit(X_train_std, y_train)
SVC(C=0.1, degree=2, kernel='poly')
# predicting the training and testing data values
y_train_pred = clf_svm_p3.predict(X_train_std)
y_test_pred = clf_svm_p3.predict(X_test_std)
# Checking the acuuracy score
accuracy_score(y_test, y_test_pred)
0.5588235294117647
# Checking the parameter values
clf_svm_p3.n_support_
array([185, 194])
# Applying SVC Classifier with rbf Kernel
clf_svm_r = svm.SVC(kernel='rbf', gamma=0.5, C=10)
clf_svm_r.fit(X_train_std, y_train)
SVC(C=10, gamma=0.5)
# Predicting the training and testing data values
y_train_pred = clf_svm_r.predict(X_train_std)
y_test_pred = clf_svm_r.predict(X_test_std)
# Checking the accurcay score
accuracy_score(y_test, y_test_pred)
0.6176470588235294
# Checking the Parameter values
clf_svm_r.n_support_
array([186, 218])
# Applying the different Hyperparameter values of C and gamma
params = {'C':(0.01,0.05, 0.1, 0.5, 1, 5, 10, 50),
'gamma':(0.001, 0.01, 0.1, 0.5, 1)}
# Applying SVC Classifier with rbf kernel
clf_svm_r = svm.SVC(kernel='rbf')
# creating objevt with the Hyperparameter GridSearchCV with different values
svm_grid_rad = GridSearchCV(clf_svm_r, params, n_jobs=-1,
cv=3, verbose=1, scoring='accuracy')
# Applying the hyperparameter to training and testing data
svm_grid_rad.fit(X_train_std, y_train)
Fitting 3 folds for each of 40 candidates, totalling 120 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed: 0.7s finished
GridSearchCV(cv=3, estimator=SVC(), n_jobs=-1,
param_grid={'C': (0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50),
'gamma': (0.001, 0.01, 0.1, 0.5, 1)},
scoring='accuracy', verbose=1)
# Checking best values of Hyperparamters
svm_grid_rad.best_params_
{'C': 50, 'gamma': 0.001}
# Checking the best estimator
radsvm_clf = svm_grid_rad.best_estimator_
# Checking the accuracy score with best Hyperparamter values
accuracy_score(y_test, radsvm_clf.predict(X_test_std))
0.6176470588235294